tight nonparametric convergence rate
Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model
In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation $Y = \langle \theta_*, \Phi(U) \rangle$ between the random output $Y$ and the random feature vector $\Phi(U)$, a potentially non-linear transformation of the inputs~$U$. We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square risk under this model. The convergence of the iterates to the optimum $\theta_*$ and the decay of the generalization error follow polynomial convergence rates with exponents that both depend on the regularities of the optimum $\theta_*$ and of the feature vectors $\Phi(U)$. We interpret our result in the reproducing kernel Hilbert space framework. As a special case, we analyze an online algorithm for estimating a real function on the unit hypercube from the noiseless observation of its value at randomly sampled points; the convergence depends on the Sobolev smoothness of the function and of a chosen kernel. Finally, we apply our analysis beyond the supervised learning setting to obtain convergence rates for the averaging process (a.k.a.
Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model
In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation Y \langle \theta_*, \Phi(U) \rangle between the random output Y and the random feature vector \Phi(U), a potentially non-linear transformation of the inputs U . We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square risk under this model. The convergence of the iterates to the optimum \theta_* and the decay of the generalization error follow polynomial convergence rates with exponents that both depend on the regularities of the optimum \theta_* and of the feature vectors \Phi(U) . We interpret our result in the reproducing kernel Hilbert space framework. As a special case, we analyze an online algorithm for estimating a real function on the unit hypercube from the noiseless observation of its value at randomly sampled points; the convergence depends on the Sobolev smoothness of the function and of a chosen kernel.
Review for NeurIPS paper: Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model
Weaknesses: While the paper is written very clearly, there are several questions I'd like to raise. Firstly, in discussing the applicability of the results the paper mentions'some basic vision or sound recognition tasks' (line 33) - I'd like to ask about some examples of such tasks. Looking at the statement of the Theorem 1, seems that it should be applicable in finite-dimensional spaces with invertible covariance matrices. If it is so, then I do not understand the results. In particular, for X distributed with a finite support and has identity covariance matrix, the conditions (a) and (b) hold for arbitrarily large positive \alpha, however the theorem statement implies that the estimates will go to zero at an arbitrarily large polynomial rate, which is not true.
Review for NeurIPS paper: Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model
The submission considers noiseless (/low noise) linear regression with non-linear transformation of the input data, and show that under this setting, SGD achieves faster convergence rates. This is a very nice contribution with applicability to important problems as mentioned by the authors in their feedback. We urge the authors to incorporate the points they made in response to the reviews.
Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model
In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation Y \langle \theta_*, \Phi(U) \rangle between the random output Y and the random feature vector \Phi(U), a potentially non-linear transformation of the inputs U . We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square risk under this model. The convergence of the iterates to the optimum \theta_* and the decay of the generalization error follow polynomial convergence rates with exponents that both depend on the regularities of the optimum \theta_* and of the feature vectors \Phi(U) . We interpret our result in the reproducing kernel Hilbert space framework. As a special case, we analyze an online algorithm for estimating a real function on the unit hypercube from the noiseless observation of its value at randomly sampled points; the convergence depends on the Sobolev smoothness of the function and of a chosen kernel.